在线持续学习(OCL)旨在通过单个通过数据从非平稳数据流进行逐步训练神经网络。基于彩排的方法试图用少量的内存近似观察到的输入分布,并以后重新审视它们以避免忘记。尽管具有强烈的经验表现,但排练方法仍然遭受了过去数据损失景观和记忆样本的差异。本文重新讨论了在线设置中的排练动态。我们从偏见和动态的经验风险最小化的角度从固有的内存过度拟合风险中提供了理论见解,并检查重复排练的优点和限制。受我们的分析的启发,一个简单而直观的基线,重复的增强彩排(RAR)旨在解决在线彩排的拟合不足的困境。令人惊讶的是,在四个相当不同的OCL基准测试中,这种简单的基线表现优于香草排练9%-17%,并且显着改善了基于最新的彩排方法miR,ASER和SCR。我们还证明,RAR成功地实现了过去数据的损失格局和其学习轨迹中的高损失山脊厌恶的准确近似。进行了广泛的消融研究,以研究重复和增强彩排和增强学习(RL)之间的相互作用(RL),以动态调整RAR的超参数以平衡在线稳定性 - 塑性权衡折衷。
translated by 谷歌翻译
最近,自我监督的预训练在W.R.T.的各种任务上具有先进的视觉变压器。不同的数据模式,例如图像和3D点云数据。在本文中,我们探讨了基于变压器的3D网格数据分析的学习范式。由于将变压器体系结构应用于新模式通常是非平凡的,因此我们首先将视觉变压器适应3D网格数据处理,即网格变压器。具体而言,我们将网格分为几个非重叠的本地贴片,每个贴片包含相同数量的面部,并使用每个贴片中心点的3D位置形成位置嵌入。受MAE的启发,我们探讨了如何使用基于变压器的结构对3D网格数据进行预训练如何使下游3D网格分析任务受益。我们首先随机掩盖网格的一些补丁,并将损坏的网格馈入网格变形金刚。然后,通过重建蒙版补丁的信息,该网络能够学习网格数据的区分表示。因此,我们命名我们的方法meshmae,可以在网格分析任务(即分类和分割)上产生最先进或可比性的性能。此外,我们还进行了全面的消融研究,以显示我们方法中关键设计的有效性。
translated by 谷歌翻译
具有大尺度图像文本对的视觉预训练(VLP)在各个领域都表现出卓越的性能。但是,Internet上的图像文本对共存通常缺乏明确的对齐信息,这对于VLP来说是次优的。建议采用现成的对象检测器来利用其他图像标签信息。但是,对象检测器是耗时的,只能识别预定义的对象类别,从而限制了模型容量。受到观察的启发,即文本包含不完整的细粒图像信息,我们介绍了Ideas,该想法代表通过在线多标签识别VLP来增加文本多样性。想法表明,可以在VLP期间共同优化从文本中提取的图像标签的多标签学习。此外,想法可以在线识别有价值的图像标签,以提供更明确的文本监督。全面的实验表明,想法可以显着提高多个下游数据集上的性能,并具有较小的额外计算成本。
translated by 谷歌翻译
在缺少标签(MLML)的情况下,多标签学习是一个具有挑战性的问题。现有方法主要关注网络结构或培训方案的设计,这提高了实现的复杂性。这项工作旨在满足MLML中的损失函数的潜力,而不增加程序和复杂性。为此,我们通过鲁棒损失设计提出了两种简单但有效的方法,基于观察到模型可以在高精度训练期间识别丢失的标签。首先是对底层的良好损失,即山损,重量底部以山的形状重量否定,以减轻虚假底片的效果。第二个是自定步损耗校正(SPLC)方法,其利用缺失标签的近似分布下的最大似然标准导出的丢失。在各种多标签图像分类数据集上的综合实验表明,我们的方法可以显着提高MLML的性能,并在MLML中实现新的最先进的损失函数。
translated by 谷歌翻译
在自我监督对比度学习范式下,小型模型表现得很差。现有方法通常采用大型现成模型,通过蒸馏将知识转移到小型。尽管有效率,但由于部署大型模型的巨大计算费用,蒸馏基方法可能不适合某些资源限制方案。在本文中,我们研究了没有蒸馏信号的自我监督小型模型的问题。我们首先评估小型模型的代表空间,并进行两个不可忽略的观察:(i)小型型号可以完成借口任务,而无需过度拟合,尽管它们有限,并且(ii)他们普遍遭受聚类问题的问题。然后我们验证了多个被认为减轻过分聚类现象的假设。最后,我们结合了验证的技术,提高了五种小型架构的基线性能,具有相当大的边缘,这表明即使没有蒸馏信号,培训小自我监督的对比模型也是可行的。该代码可在\ texit {https://github.com/wodeice/sl-small}中获得。
translated by 谷歌翻译
Detecting sarcasm and verbal irony from people's subjective statements is crucial to understanding their intended meanings and real sentiments and positions in social scenarios. This paper describes the X-PuDu system that participated in SemEval-2022 Task 6, iSarcasmEval - Intended Sarcasm Detection in English and Arabic, which aims at detecting intended sarcasm in various settings of natural language understanding. Our solution finetunes pre-trained language models, such as ERNIE-M and DeBERTa, under the multilingual settings to recognize the irony from Arabic and English texts. Our system ranked second out of 43, and ninth out of 32 in Task A: one-sentence detection in English and Arabic; fifth out of 22 in Task B: binary multi-label classification in English; first out of 16, and fifth out of 13 in Task C: sentence-pair detection in English and Arabic.
translated by 谷歌翻译
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.
translated by 谷歌翻译
混合样品正则化(MSR),例如混合或cutmix,是一种强大的数据增强策略,可以推广卷积神经网络。先前的经验分析说明了MSR与传统的离线知识蒸馏(KD)之间的正交性能增长。更具体地说,可以通过MSR参与顺序蒸馏的训练阶段来增强学生网络。然而,MSR和在线知识蒸馏之间的相互作用,这是一个更强的蒸馏范式,在那里,一群同伴互相学习的合奏仍然没有探索。为了弥合差距,我们首次尝试将cutmix纳入在线蒸馏中,我们从经验上观察到了重大改进。在这个事实的鼓舞下,我们提出了一个更强大的MSR,专门用于在线蒸馏,称为Cut^nMix。此外,一个新颖的在线蒸馏框架是在切割^nmix上设计的,以通过功能水平相互学习和自我启动的老师来增强蒸馏。对CIFAR10和CIFAR100进行六个网络体系结构的全面评估表明,我们的方法可以始终超过最先进的蒸馏方法。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译